Titanic Wrangling

In this practice activity you’ll continue to work with the titanic dataset in ways that flex what you’ve learned about both data wrangling and data visualization.

import pandas as pd
import numpy as np
import plotly.express as px

data_dir = "https://dlsun.github.io/pods/data/"
df_titanic = pd.read_csv(data_dir + "titanic.csv")

# Keep only rows that have class & embarked info
# (and, if class is missing but pclass exists, construct class)
df = df_titanic.copy()
if "class" not in df.columns and "pclass" in df.columns:
    _map = {1: "First", 2: "Second", 3: "Third"}
    df["class"] = df["pclass"].map(_map)

df = df.dropna(subset=["class", "embarked"])
df.head()
name gender age class embarked country ticketno fare survived
0 Abbing, Mr. Anthony male 42.0 3rd S United States 5547.0 7.11 0
1 Abbott, Mr. Eugene Joseph male 13.0 3rd S United States 2673.0 20.05 0
2 Abbott, Mr. Rossmore Edward male 16.0 3rd S United States 2673.0 20.05 0
3 Abbott, Mrs. Rhoda Mary 'Rosa' female 39.0 3rd S England 2673.0 20.05 1
4 Abelseth, Miss. Karen Marie female 16.0 3rd S Norway 348125.0 7.13 1

1. Filter the data to include passengers only. Calculate the joint distribution (cross-tab) between a passenger’s class and where they embarked.

joint = pd.crosstab(df["class"], df["embarked"])
joint
embarked B C Q S
class
1st 3 143 3 175
2nd 6 26 7 245
3rd 0 102 113 494
deck crew 23 0 0 43
engineering crew 43 0 0 281
restaurant staff 0 0 0 69
victualling crew 122 0 0 309

2. Using the joint distribution that calculated above, calculate the following:

  • the conditional distribution of their class given where they embarked
  • the conditional distribution of where they embarked given their class

Use the conditional distributions that you calculate to answer the following quesitons:

  • What proportion of 3rd class passengers embarked at Southampton?
  • What proportion of Southampton passengers were in 3rd class?

passenger_classes = ["1st", "2nd", "3rd"]
df_pax = df[df["class"].isin(passenger_classes)].copy()

joint = pd.crosstab(df_pax["class"], df_pax["embarked"])
display(joint)


cond_class_given_embarked = pd.crosstab(
    df_pax["class"], df_pax["embarked"], normalize="columns"
)
cond_embarked_given_class = pd.crosstab(
    df_pax["class"], df_pax["embarked"], normalize="index"
)
display(cond_class_given_embarked)
display(cond_embarked_given_class)


prop_S_given_3rd = cond_embarked_given_class.loc["3rd", "S"]


prop_3rd_given_S = cond_class_given_embarked.loc["3rd", "S"]

print(f"P(S | 3rd) = {prop_S_given_3rd:.3f}")
print(f"P(3rd | S) = {prop_3rd_given_S:.3f}")
embarked B C Q S
class
1st 3 143 3 175
2nd 6 26 7 245
3rd 0 102 113 494
embarked B C Q S
class
1st 0.333333 0.527675 0.024390 0.191466
2nd 0.666667 0.095941 0.056911 0.268053
3rd 0.000000 0.376384 0.918699 0.540481
embarked B C Q S
class
1st 0.009259 0.441358 0.009259 0.540123
2nd 0.021127 0.091549 0.024648 0.862676
3rd 0.000000 0.143865 0.159379 0.696756
P(S | 3rd) = 0.697
P(3rd | S) = 0.540

Most 3rd-class passengers (≈70%) embarked at Southampton, and about 31% of all Southampton passengers were 3rd class. This shows 3rd class mainly boarded at Southampton, while 1st class was more common at Cherbourg.

3. Make a visualization showing the distribution of a passenger’s class, given where they embarked.

Discuss the pros and cons of using this visualization versus the distributions you calculated before, to answer the previous questions.



viz_df = (
    cond_class_given_embarked
    .reset_index()
    .melt(id_vars="class", var_name="embarked", value_name="proportion")
)


fig = px.bar(
    viz_df,
    x="embarked",
    y="proportion",
    color="class",
    barmode="group",
    text=viz_df["proportion"].map(lambda x: f"{x:.2f}")
)
fig.update_layout(
    title="Distribution of Passenger Class, Given Where They Embarked (P(class | embarked))",
    yaxis=dict(title="Proportion", tickformat=".0%"),
    xaxis_title="Embarkation Port"
)
fig.show()

Pros: easy to compare class proportions within each embark point Cons: harder to see that each set sums to 1, but still much clearer visually.